Assignment -4¶

21BCE9778¶

Questions¶

1.Download the Employee Attrition Dataset
https://www.kaggle.com/datasets/patelprashant/employee-attrition
2.Perfrom Data Preprocessing
3.Model Building using Logistic Regression and Decision Tree
4.Calculate Performance metrics

Import the libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Importing the dataset

In [2]:
data = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
data.head(15)
Out[2]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 ... 1 80 0 8 0 1 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 2 ... 4 80 1 10 3 3 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 4 ... 2 80 0 7 3 3 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 5 ... 3 80 0 8 3 3 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 7 ... 4 80 1 6 3 3 2 2 2 2
5 32 No Travel_Frequently 1005 Research & Development 2 2 Life Sciences 1 8 ... 3 80 0 8 2 2 7 7 3 6
6 59 No Travel_Rarely 1324 Research & Development 3 3 Medical 1 10 ... 1 80 3 12 3 2 1 0 0 0
7 30 No Travel_Rarely 1358 Research & Development 24 1 Life Sciences 1 11 ... 2 80 1 1 2 3 1 0 0 0
8 38 No Travel_Frequently 216 Research & Development 23 3 Life Sciences 1 12 ... 2 80 0 10 2 3 9 7 1 8
9 36 No Travel_Rarely 1299 Research & Development 27 3 Medical 1 13 ... 2 80 2 17 3 2 7 7 7 7
10 35 No Travel_Rarely 809 Research & Development 16 3 Medical 1 14 ... 3 80 1 6 5 3 5 4 0 3
11 29 No Travel_Rarely 153 Research & Development 15 2 Life Sciences 1 15 ... 4 80 0 10 3 3 9 5 0 8
12 31 No Travel_Rarely 670 Research & Development 26 1 Life Sciences 1 16 ... 4 80 1 5 1 2 5 2 4 3
13 34 No Travel_Rarely 1346 Research & Development 19 2 Medical 1 18 ... 3 80 1 3 2 3 2 2 1 2
14 28 Yes Travel_Rarely 103 Research & Development 24 3 Life Sciences 1 19 ... 2 80 0 6 4 3 4 2 0 3

15 rows × 35 columns

In [3]:
data.columns
Out[3]:
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')
In [4]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                  1470 non-null   int64 
 15  JobRole                   1470 non-null   object
 16  JobSatisfaction           1470 non-null   int64 
 17  MaritalStatus             1470 non-null   object
 18  MonthlyIncome             1470 non-null   int64 
 19  MonthlyRate               1470 non-null   int64 
 20  NumCompaniesWorked        1470 non-null   int64 
 21  Over18                    1470 non-null   object
 22  OverTime                  1470 non-null   object
 23  PercentSalaryHike         1470 non-null   int64 
 24  PerformanceRating         1470 non-null   int64 
 25  RelationshipSatisfaction  1470 non-null   int64 
 26  StandardHours             1470 non-null   int64 
 27  StockOptionLevel          1470 non-null   int64 
 28  TotalWorkingYears         1470 non-null   int64 
 29  TrainingTimesLastYear     1470 non-null   int64 
 30  WorkLifeBalance           1470 non-null   int64 
 31  YearsAtCompany            1470 non-null   int64 
 32  YearsInCurrentRole        1470 non-null   int64 
 33  YearsSinceLastPromotion   1470 non-null   int64 
 34  YearsWithCurrManager      1470 non-null   int64 
dtypes: int64(26), object(9)
memory usage: 402.1+ KB
In [5]:
data.describe()
Out[5]:
Age DailyRate DistanceFromHome Education EmployeeCount EmployeeNumber EnvironmentSatisfaction HourlyRate JobInvolvement JobLevel ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
count 1470.000000 1470.000000 1470.000000 1470.000000 1470.0 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 ... 1470.000000 1470.0 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000
mean 36.923810 802.485714 9.192517 2.912925 1.0 1024.865306 2.721769 65.891156 2.729932 2.063946 ... 2.712245 80.0 0.793878 11.279592 2.799320 2.761224 7.008163 4.229252 2.187755 4.123129
std 9.135373 403.509100 8.106864 1.024165 0.0 602.024335 1.093082 20.329428 0.711561 1.106940 ... 1.081209 0.0 0.852077 7.780782 1.289271 0.706476 6.126525 3.623137 3.222430 3.568136
min 18.000000 102.000000 1.000000 1.000000 1.0 1.000000 1.000000 30.000000 1.000000 1.000000 ... 1.000000 80.0 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
25% 30.000000 465.000000 2.000000 2.000000 1.0 491.250000 2.000000 48.000000 2.000000 1.000000 ... 2.000000 80.0 0.000000 6.000000 2.000000 2.000000 3.000000 2.000000 0.000000 2.000000
50% 36.000000 802.000000 7.000000 3.000000 1.0 1020.500000 3.000000 66.000000 3.000000 2.000000 ... 3.000000 80.0 1.000000 10.000000 3.000000 3.000000 5.000000 3.000000 1.000000 3.000000
75% 43.000000 1157.000000 14.000000 4.000000 1.0 1555.750000 4.000000 83.750000 3.000000 3.000000 ... 4.000000 80.0 1.000000 15.000000 3.000000 3.000000 9.000000 7.000000 3.000000 7.000000
max 60.000000 1499.000000 29.000000 5.000000 1.0 2068.000000 4.000000 100.000000 4.000000 5.000000 ... 4.000000 80.0 3.000000 40.000000 6.000000 4.000000 40.000000 18.000000 15.000000 17.000000

8 rows × 26 columns

In [6]:
(data.describe()).columns
Out[6]:
Index(['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome',
       'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object')

checking for null values

In [6]:
data.isnull().any()
Out[6]:
Age                         False
Attrition                   False
BusinessTravel              False
DailyRate                   False
Department                  False
DistanceFromHome            False
Education                   False
EducationField              False
EmployeeCount               False
EmployeeNumber              False
EnvironmentSatisfaction     False
Gender                      False
HourlyRate                  False
JobInvolvement              False
JobLevel                    False
JobRole                     False
JobSatisfaction             False
MaritalStatus               False
MonthlyIncome               False
MonthlyRate                 False
NumCompaniesWorked          False
Over18                      False
OverTime                    False
PercentSalaryHike           False
PerformanceRating           False
RelationshipSatisfaction    False
StandardHours               False
StockOptionLevel            False
TotalWorkingYears           False
TrainingTimesLastYear       False
WorkLifeBalance             False
YearsAtCompany              False
YearsInCurrentRole          False
YearsSinceLastPromotion     False
YearsWithCurrManager        False
dtype: bool
In [7]:
data.isnull().sum()
Out[7]:
Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     0
YearsWithCurrManager        0
dtype: int64
In [51]:
data.isna().count()
Out[51]:
Age                         1386
Attrition                   1386
BusinessTravel              1386
DailyRate                   1386
Department                  1386
DistanceFromHome            1386
Education                   1386
EducationField              1386
EmployeeCount               1386
EmployeeNumber              1386
EnvironmentSatisfaction     1386
Gender                      1386
HourlyRate                  1386
JobInvolvement              1386
JobLevel                    1386
JobRole                     1386
JobSatisfaction             1386
MaritalStatus               1386
MonthlyIncome               1386
MonthlyRate                 1386
NumCompaniesWorked          1386
Over18                      1386
OverTime                    1386
PercentSalaryHike           1386
PerformanceRating           1386
RelationshipSatisfaction    1386
StandardHours               1386
StockOptionLevel            1386
TotalWorkingYears           1386
TrainingTimesLastYear       1386
WorkLifeBalance             1386
YearsAtCompany              1386
YearsInCurrentRole          1386
YearsSinceLastPromotion     1386
YearsWithCurrManager        1386
dtype: int64

Data Visualization

In [8]:
data1 = data[['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome',
       'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager']]
corr=data1.corr()
corr
Out[8]:
Age DailyRate DistanceFromHome Education EmployeeCount EmployeeNumber EnvironmentSatisfaction HourlyRate JobInvolvement JobLevel ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
Age 1.000000 0.010661 -0.001686 0.208034 NaN -0.010145 0.010146 0.024287 0.029820 0.509604 ... 0.053535 NaN 0.037510 0.680381 -0.019621 -0.021490 0.311309 0.212901 0.216513 0.202089
DailyRate 0.010661 1.000000 -0.004985 -0.016806 NaN -0.050990 0.018355 0.023381 0.046135 0.002966 ... 0.007846 NaN 0.042143 0.014515 0.002453 -0.037848 -0.034055 0.009932 -0.033229 -0.026363
DistanceFromHome -0.001686 -0.004985 1.000000 0.021042 NaN 0.032916 -0.016075 0.031131 0.008783 0.005303 ... 0.006557 NaN 0.044872 0.004628 -0.036942 -0.026556 0.009508 0.018845 0.010029 0.014406
Education 0.208034 -0.016806 0.021042 1.000000 NaN 0.042070 -0.027128 0.016775 0.042438 0.101589 ... -0.009118 NaN 0.018422 0.148280 -0.025100 0.009819 0.069114 0.060236 0.054254 0.069065
EmployeeCount NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
EmployeeNumber -0.010145 -0.050990 0.032916 0.042070 NaN 1.000000 0.017621 0.035179 -0.006888 -0.018519 ... -0.069861 NaN 0.062227 -0.014365 0.023603 0.010309 -0.011240 -0.008416 -0.009019 -0.009197
EnvironmentSatisfaction 0.010146 0.018355 -0.016075 -0.027128 NaN 0.017621 1.000000 -0.049857 -0.008278 0.001212 ... 0.007665 NaN 0.003432 -0.002693 -0.019359 0.027627 0.001458 0.018007 0.016194 -0.004999
HourlyRate 0.024287 0.023381 0.031131 0.016775 NaN 0.035179 -0.049857 1.000000 0.042861 -0.027853 ... 0.001330 NaN 0.050263 -0.002334 -0.008548 -0.004607 -0.019582 -0.024106 -0.026716 -0.020123
JobInvolvement 0.029820 0.046135 0.008783 0.042438 NaN -0.006888 -0.008278 0.042861 1.000000 -0.012630 ... 0.034297 NaN 0.021523 -0.005533 -0.015338 -0.014617 -0.021355 0.008717 -0.024184 0.025976
JobLevel 0.509604 0.002966 0.005303 0.101589 NaN -0.018519 0.001212 -0.027853 -0.012630 1.000000 ... 0.021642 NaN 0.013984 0.782208 -0.018191 0.037818 0.534739 0.389447 0.353885 0.375281
JobSatisfaction -0.004892 0.030571 -0.003669 -0.011296 NaN -0.046247 -0.006784 -0.071335 -0.021476 -0.001944 ... -0.012454 NaN 0.010690 -0.020185 -0.005779 -0.019459 -0.003803 -0.002305 -0.018214 -0.027656
MonthlyIncome 0.497855 0.007707 -0.017014 0.094961 NaN -0.014829 -0.006259 -0.015794 -0.015271 0.950300 ... 0.025873 NaN 0.005408 0.772893 -0.021736 0.030683 0.514285 0.363818 0.344978 0.344079
MonthlyRate 0.028051 -0.032182 0.027473 -0.026084 NaN 0.012648 0.037600 -0.015297 -0.016322 0.039563 ... -0.004085 NaN -0.034323 0.026442 0.001467 0.007963 -0.023655 -0.012815 0.001567 -0.036746
NumCompaniesWorked 0.299635 0.038153 -0.029251 0.126317 NaN -0.001251 0.012594 0.022157 0.015012 0.142501 ... 0.052733 NaN 0.030075 0.237639 -0.066054 -0.008366 -0.118421 -0.090754 -0.036814 -0.110319
PercentSalaryHike 0.003634 0.022704 0.040235 -0.011111 NaN -0.012944 -0.031701 -0.009062 -0.017205 -0.034730 ... -0.040490 NaN 0.007528 -0.020608 -0.005221 -0.003280 -0.035991 -0.001520 -0.022154 -0.011985
PerformanceRating 0.001904 0.000473 0.027110 -0.024539 NaN -0.020359 -0.029548 -0.002172 -0.029071 -0.021222 ... -0.031351 NaN 0.003506 0.006744 -0.015579 0.002572 0.003435 0.034986 0.017896 0.022827
RelationshipSatisfaction 0.053535 0.007846 0.006557 -0.009118 NaN -0.069861 0.007665 0.001330 0.034297 0.021642 ... 1.000000 NaN -0.045952 0.024054 0.002497 0.019604 0.019367 -0.015123 0.033493 -0.000867
StandardHours NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
StockOptionLevel 0.037510 0.042143 0.044872 0.018422 NaN 0.062227 0.003432 0.050263 0.021523 0.013984 ... -0.045952 NaN 1.000000 0.010136 0.011274 0.004129 0.015058 0.050818 0.014352 0.024698
TotalWorkingYears 0.680381 0.014515 0.004628 0.148280 NaN -0.014365 -0.002693 -0.002334 -0.005533 0.782208 ... 0.024054 NaN 0.010136 1.000000 -0.035662 0.001008 0.628133 0.460365 0.404858 0.459188
TrainingTimesLastYear -0.019621 0.002453 -0.036942 -0.025100 NaN 0.023603 -0.019359 -0.008548 -0.015338 -0.018191 ... 0.002497 NaN 0.011274 -0.035662 1.000000 0.028072 0.003569 -0.005738 -0.002067 -0.004096
WorkLifeBalance -0.021490 -0.037848 -0.026556 0.009819 NaN 0.010309 0.027627 -0.004607 -0.014617 0.037818 ... 0.019604 NaN 0.004129 0.001008 0.028072 1.000000 0.012089 0.049856 0.008941 0.002759
YearsAtCompany 0.311309 -0.034055 0.009508 0.069114 NaN -0.011240 0.001458 -0.019582 -0.021355 0.534739 ... 0.019367 NaN 0.015058 0.628133 0.003569 0.012089 1.000000 0.758754 0.618409 0.769212
YearsInCurrentRole 0.212901 0.009932 0.018845 0.060236 NaN -0.008416 0.018007 -0.024106 0.008717 0.389447 ... -0.015123 NaN 0.050818 0.460365 -0.005738 0.049856 0.758754 1.000000 0.548056 0.714365
YearsSinceLastPromotion 0.216513 -0.033229 0.010029 0.054254 NaN -0.009019 0.016194 -0.026716 -0.024184 0.353885 ... 0.033493 NaN 0.014352 0.404858 -0.002067 0.008941 0.618409 0.548056 1.000000 0.510224
YearsWithCurrManager 0.202089 -0.026363 0.014406 0.069065 NaN -0.009197 -0.004999 -0.020123 0.025976 0.375281 ... -0.000867 NaN 0.024698 0.459188 -0.004096 0.002759 0.769212 0.714365 0.510224 1.000000

26 rows × 26 columns

In [9]:
plt.subplots(figsize=(20,15))
sns.heatmap(corr,annot=True)
Out[9]:
<Axes: >
In [10]:
sns.scatterplot(x="TotalWorkingYears",y="JobLevel",data=data1)
c:\Users\kushal\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
c:\Users\kushal\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
Out[10]:
<Axes: xlabel='TotalWorkingYears', ylabel='JobLevel'>
In [16]:
sns.pairplot(data1)
Out[16]:
<seaborn.axisgrid.PairGrid at 0x1dfabc4eda0>

Outlier Detection and removal

In [20]:
sns.boxplot(data.TotalWorkingYears)
Out[20]:
<Axes: >
In [11]:
from scipy import stats
import numpy as np
 
z = np.abs(stats.zscore(data['TotalWorkingYears']))
print(z)
0       0.421642
1       0.164511
2       0.550208
3       0.421642
4       0.678774
          ...   
1465    0.735447
1466    0.293077
1467    0.678774
1468    0.735447
1469    0.678774
Name: TotalWorkingYears, Length: 1470, dtype: float64
In [12]:
threshold = 2
print(np.where(z > 2))
indexes= np.where(z > 2)[0]
(array([  18,   62,   63,   85,   98,  105,  106,  126,  187,  190,  233,
        237,  263,  270,  279,  379,  401,  406,  408,  411,  424,  425,
        445,  465,  473,  477,  534,  544,  552,  561,  588,  595,  616,
        624,  627,  646,  649,  653,  677,  714,  736,  743,  749,  760,
        766,  774,  804,  851,  861,  867,  890,  894,  914,  918,  956,
        962,  966,  971,  976, 1008, 1009, 1010, 1031, 1043, 1054, 1086,
       1111, 1116, 1126, 1135, 1138, 1154, 1176, 1181, 1184, 1194, 1264,
       1268, 1301, 1303, 1331, 1374, 1377, 1401], dtype=int64),)
In [13]:
data.drop(index=indexes, inplace=True)

Splitting Independent and Dependent Variables

In [32]:
data12=data.drop(columns=(data.describe()).columns)
data12.columns
Out[32]:
Index(['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender',
       'JobRole', 'MaritalStatus', 'Over18', 'OverTime'],
      dtype='object')

Encoding Categorical Variables

In [15]:
from sklearn.preprocessing import LabelEncoder
In [16]:
le=LabelEncoder()
In [83]:
one_hot_encoded_data = pd.get_dummies(data, columns = ['BusinessTravel', 'Department', 'EducationField', 'Gender',
       'JobRole', 'MaritalStatus', 'Over18', 'OverTime'])
one_hot_encoded_data
Out[83]:
Age Attrition DailyRate DistanceFromHome Education EmployeeCount EmployeeNumber EnvironmentSatisfaction HourlyRate JobInvolvement ... JobRole_Research Director JobRole_Research Scientist JobRole_Sales Executive JobRole_Sales Representative MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single Over18_Y OverTime_No OverTime_Yes
0 41 Yes 1102 1 2 1 1 2 94 3 ... False False True False False False True True False True
1 49 No 279 8 1 1 2 3 61 2 ... False True False False False True False True True False
2 37 Yes 1373 2 2 1 4 4 92 2 ... False False False False False False True True False True
3 33 No 1392 3 4 1 5 4 56 3 ... False True False False False True False True False True
4 27 No 591 2 1 1 7 1 40 3 ... False False False False False True False True True False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1465 36 No 884 23 2 1 2061 3 41 4 ... False False False False False True False True True False
1466 39 No 613 6 1 1 2062 4 42 2 ... False False False False False True False True True False
1467 27 No 155 4 3 1 2064 2 87 4 ... False False False False False True False True False True
1468 49 No 1023 2 3 1 2065 4 63 2 ... False False True False False True False True True False
1469 34 No 628 8 3 1 2068 2 82 4 ... False False False False False True False True True False

1386 rows × 56 columns

In [84]:
one_hot_encoded_data.columns
Out[84]:
Index(['Age', 'Attrition', 'DailyRate', 'DistanceFromHome', 'Education',
       'EmployeeCount', 'EmployeeNumber', 'EnvironmentSatisfaction',
       'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobSatisfaction',
       'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction',
       'StandardHours', 'StockOptionLevel', 'TotalWorkingYears',
       'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany',
       'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager',
       'BusinessTravel_Non-Travel', 'BusinessTravel_Travel_Frequently',
       'BusinessTravel_Travel_Rarely', 'Department_Human Resources',
       'Department_Research & Development', 'Department_Sales',
       'EducationField_Human Resources', 'EducationField_Life Sciences',
       'EducationField_Marketing', 'EducationField_Medical',
       'EducationField_Other', 'EducationField_Technical Degree',
       'Gender_Female', 'Gender_Male', 'JobRole_Healthcare Representative',
       'JobRole_Human Resources', 'JobRole_Laboratory Technician',
       'JobRole_Manager', 'JobRole_Manufacturing Director',
       'JobRole_Research Director', 'JobRole_Research Scientist',
       'JobRole_Sales Executive', 'JobRole_Sales Representative',
       'MaritalStatus_Divorced', 'MaritalStatus_Married',
       'MaritalStatus_Single', 'Over18_Y', 'OverTime_No', 'OverTime_Yes'],
      dtype='object')
In [85]:
one_hot_encoded_data[['BusinessTravel_Non-Travel', 'BusinessTravel_Travel_Frequently',
       'BusinessTravel_Travel_Rarely', 'Department_Human Resources',
       'Department_Research & Development', 'Department_Sales',
       'EducationField_Human Resources', 'EducationField_Life Sciences',
       'EducationField_Marketing', 'EducationField_Medical',
       'EducationField_Other', 'EducationField_Technical Degree',
       'Gender_Female', 'Gender_Male', 'JobRole_Healthcare Representative',
       'JobRole_Human Resources', 'JobRole_Laboratory Technician',
       'JobRole_Manager', 'JobRole_Manufacturing Director',
       'JobRole_Research Director', 'JobRole_Research Scientist',
       'JobRole_Sales Executive', 'JobRole_Sales Representative',
       'MaritalStatus_Divorced', 'MaritalStatus_Married',
       'MaritalStatus_Single', 'Over18_Y', 'OverTime_No', 'OverTime_Yes']] *= 1
one_hot_encoded_data 
Out[85]:
Age Attrition DailyRate DistanceFromHome Education EmployeeCount EmployeeNumber EnvironmentSatisfaction HourlyRate JobInvolvement ... JobRole_Research Director JobRole_Research Scientist JobRole_Sales Executive JobRole_Sales Representative MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single Over18_Y OverTime_No OverTime_Yes
0 41 Yes 1102 1 2 1 1 2 94 3 ... 0 0 1 0 0 0 1 1 0 1
1 49 No 279 8 1 1 2 3 61 2 ... 0 1 0 0 0 1 0 1 1 0
2 37 Yes 1373 2 2 1 4 4 92 2 ... 0 0 0 0 0 0 1 1 0 1
3 33 No 1392 3 4 1 5 4 56 3 ... 0 1 0 0 0 1 0 1 0 1
4 27 No 591 2 1 1 7 1 40 3 ... 0 0 0 0 0 1 0 1 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1465 36 No 884 23 2 1 2061 3 41 4 ... 0 0 0 0 0 1 0 1 1 0
1466 39 No 613 6 1 1 2062 4 42 2 ... 0 0 0 0 0 1 0 1 1 0
1467 27 No 155 4 3 1 2064 2 87 4 ... 0 0 0 0 0 1 0 1 0 1
1468 49 No 1023 2 3 1 2065 4 63 2 ... 0 0 1 0 0 1 0 1 1 0
1469 34 No 628 8 3 1 2068 2 82 4 ... 0 0 0 0 0 1 0 1 1 0

1386 rows × 56 columns

In [86]:
data11=one_hot_encoded_data
In [87]:
data11.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1386 entries, 0 to 1469
Data columns (total 56 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   Age                                1386 non-null   int64 
 1   Attrition                          1386 non-null   object
 2   DailyRate                          1386 non-null   int64 
 3   DistanceFromHome                   1386 non-null   int64 
 4   Education                          1386 non-null   int64 
 5   EmployeeCount                      1386 non-null   int64 
 6   EmployeeNumber                     1386 non-null   int64 
 7   EnvironmentSatisfaction            1386 non-null   int64 
 8   HourlyRate                         1386 non-null   int64 
 9   JobInvolvement                     1386 non-null   int64 
 10  JobLevel                           1386 non-null   int64 
 11  JobSatisfaction                    1386 non-null   int64 
 12  MonthlyIncome                      1386 non-null   int64 
 13  MonthlyRate                        1386 non-null   int64 
 14  NumCompaniesWorked                 1386 non-null   int64 
 15  PercentSalaryHike                  1386 non-null   int64 
 16  PerformanceRating                  1386 non-null   int64 
 17  RelationshipSatisfaction           1386 non-null   int64 
 18  StandardHours                      1386 non-null   int64 
 19  StockOptionLevel                   1386 non-null   int64 
 20  TotalWorkingYears                  1386 non-null   int64 
 21  TrainingTimesLastYear              1386 non-null   int64 
 22  WorkLifeBalance                    1386 non-null   int64 
 23  YearsAtCompany                     1386 non-null   int64 
 24  YearsInCurrentRole                 1386 non-null   int64 
 25  YearsSinceLastPromotion            1386 non-null   int64 
 26  YearsWithCurrManager               1386 non-null   int64 
 27  BusinessTravel_Non-Travel          1386 non-null   int32 
 28  BusinessTravel_Travel_Frequently   1386 non-null   int32 
 29  BusinessTravel_Travel_Rarely       1386 non-null   int32 
 30  Department_Human Resources         1386 non-null   int32 
 31  Department_Research & Development  1386 non-null   int32 
 32  Department_Sales                   1386 non-null   int32 
 33  EducationField_Human Resources     1386 non-null   int32 
 34  EducationField_Life Sciences       1386 non-null   int32 
 35  EducationField_Marketing           1386 non-null   int32 
 36  EducationField_Medical             1386 non-null   int32 
 37  EducationField_Other               1386 non-null   int32 
 38  EducationField_Technical Degree    1386 non-null   int32 
 39  Gender_Female                      1386 non-null   int32 
 40  Gender_Male                        1386 non-null   int32 
 41  JobRole_Healthcare Representative  1386 non-null   int32 
 42  JobRole_Human Resources            1386 non-null   int32 
 43  JobRole_Laboratory Technician      1386 non-null   int32 
 44  JobRole_Manager                    1386 non-null   int32 
 45  JobRole_Manufacturing Director     1386 non-null   int32 
 46  JobRole_Research Director          1386 non-null   int32 
 47  JobRole_Research Scientist         1386 non-null   int32 
 48  JobRole_Sales Executive            1386 non-null   int32 
 49  JobRole_Sales Representative       1386 non-null   int32 
 50  MaritalStatus_Divorced             1386 non-null   int32 
 51  MaritalStatus_Married              1386 non-null   int32 
 52  MaritalStatus_Single               1386 non-null   int32 
 53  Over18_Y                           1386 non-null   int32 
 54  OverTime_No                        1386 non-null   int32 
 55  OverTime_Yes                       1386 non-null   int32 
dtypes: int32(29), int64(26), object(1)
memory usage: 460.2+ KB

Feature Scaling

In [90]:
y=data11['Attrition']
X=data11.drop(columns=['Attrition'])
cols = X.columns
In [91]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
In [92]:
X = pd.DataFrame(X, columns=[cols])

Splitting Data into Train and Test

In [93]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=0)
In [94]:
x_train
Out[94]:
Age DailyRate DistanceFromHome Education EmployeeCount EmployeeNumber EnvironmentSatisfaction HourlyRate JobInvolvement JobLevel ... JobRole_Research Director JobRole_Research Scientist JobRole_Sales Executive JobRole_Sales Representative MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single Over18_Y OverTime_No OverTime_Yes
395 0.380952 0.138252 0.178571 0.25 0.0 0.268021 1.000000 0.042857 0.000000 0.00 ... 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0
113 0.285714 0.866046 0.785714 0.50 0.0 0.076439 0.000000 0.942857 0.000000 0.00 ... 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
1018 0.523810 0.410458 0.285714 0.75 0.0 0.741655 1.000000 0.800000 0.666667 0.00 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
482 0.404762 0.654728 0.178571 0.75 0.0 0.333817 0.333333 0.742857 0.000000 0.25 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
249 0.476190 0.876791 0.035714 0.25 0.0 0.174165 1.000000 0.128571 0.333333 0.25 ... 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
763 0.666667 0.217049 0.071429 0.00 0.0 0.543299 0.000000 0.314286 0.666667 0.75 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
835 0.166667 0.897564 0.321429 0.75 0.0 0.599419 0.666667 0.385714 0.666667 0.25 ... 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
1216 0.547619 0.246418 0.142857 0.50 0.0 0.877117 0.333333 0.785714 1.000000 0.25 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
559 0.357143 0.078797 0.214286 0.50 0.0 0.394775 0.666667 0.271429 0.666667 0.50 ... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
684 0.142857 0.790115 0.571429 0.00 0.0 0.487663 1.000000 0.157143 0.333333 0.25 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0

970 rows × 55 columns

In [95]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear', random_state=0)
# fit the model
logreg.fit(x_train, y_train)
Out[95]:
LogisticRegression(random_state=0, solver='liblinear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(random_state=0, solver='liblinear')
In [97]:
print('Training set score: {:.4f}'.format(logreg.score(x_train, y_train)))
print('Test set score: {:.4f}'.format(logreg.score(x_test, y_test)))
Training set score: 0.8928
Test set score: 0.8678
In [99]:
pred_log = logreg.predict(X)
In [104]:
pred_y_log = logreg.predict(x_test)

Performance Metrics of Logistic Regression¶

confusion matrix¶
In [100]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import multilabel_confusion_matrix
cm = multilabel_confusion_matrix(y, pred_log)
cs = confusion_matrix(y,pred_log)

print('Confusion matrix\n\n', cm)

print('\nTrue Positives(TP) = ', cm[0,0])

print('\nTrue Negatives(TN) = ', cm[1,1])

print('\nFalse Positives(FP) = ', cm[0,1])

print('\nFalse Negatives(FN) = ', cm[1,0])
Confusion matrix

 [[[ 110  121]
  [  38 1117]]

 [[1117   38]
  [ 121  110]]]

True Positives(TP) =  [110 121]

True Negatives(TN) =  [121 110]

False Positives(FP) =  [  38 1117]

False Negatives(FN) =  [1117   38]
Classification Report¶
In [105]:
from sklearn.metrics import classification_report

print(classification_report(y_test, pred_y_log))
              precision    recall  f1-score   support

          No       0.91      0.94      0.92       354
         Yes       0.57      0.48      0.52        62

    accuracy                           0.87       416
   macro avg       0.74      0.71      0.72       416
weighted avg       0.86      0.87      0.86       416

Classification Accuracy¶
In [106]:
TP = cs[0,0]
TN = cs[1,1]
FP = cs[0,1]
FN = cs[1,0]
In [107]:
classification_accuracy = (TP + TN) / float(TP + TN + FP + FN)

print('Classification accuracy : {0:0.4f}'.format(classification_accuracy))
Classification accuracy : 0.8853
Classification Error¶
In [108]:
classification_error = (FP + FN) / float(TP + TN + FP + FN)

print('Classification error : {0:0.4f}'.format(classification_error))
Classification error : 0.1147
Precision¶
In [109]:
precision = TP / float(TP + FP)


print('Precision : {0:0.4f}'.format(precision))
Precision : 0.9671
Recall¶
In [110]:
recall = TP / float(TP + FN)

print('Recall or Sensitivity : {0:0.4f}'.format(recall))
Recall or Sensitivity : 0.9023
True Positive Rate¶
In [111]:
true_positive_rate = TP / float(TP + FN)
print('True Positive Rate : {0:0.4f}'.format(true_positive_rate))
True Positive Rate : 0.9023
False Positive Rate¶

false_positive_rate = FP / float(FP + TN)

print('False Positive Rate : {0:0.4f}'.format(false_positive_rate))

Specificity¶
In [112]:
specificity = TN / (TN + FP)
print('Specificity : {0:0.4f}'.format(specificity))
Specificity : 0.7432
f1-score¶
In [113]:
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

# Computing the F1 score
f1score = f1_score(y_test, pred_y_log, average='micro')

print("F1 score: {:.2f}".format(f1score))
F1 score: 0.87
In [115]:
from sklearn.tree import DecisionTreeClassifier
help(DecisionTreeClassifier)
Help on class DecisionTreeClassifier in module sklearn.tree._classes:

class DecisionTreeClassifier(sklearn.base.ClassifierMixin, BaseDecisionTree)
 |  DecisionTreeClassifier(*, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0)
 |  
 |  A decision tree classifier.
 |  
 |  Read more in the :ref:`User Guide <tree>`.
 |  
 |  Parameters
 |  ----------
 |  criterion : {"gini", "entropy", "log_loss"}, default="gini"
 |      The function to measure the quality of a split. Supported criteria are
 |      "gini" for the Gini impurity and "log_loss" and "entropy" both for the
 |      Shannon information gain, see :ref:`tree_mathematical_formulation`.
 |  
 |  splitter : {"best", "random"}, default="best"
 |      The strategy used to choose the split at each node. Supported
 |      strategies are "best" to choose the best split and "random" to choose
 |      the best random split.
 |  
 |  max_depth : int, default=None
 |      The maximum depth of the tree. If None, then nodes are expanded until
 |      all leaves are pure or until all leaves contain less than
 |      min_samples_split samples.
 |  
 |  min_samples_split : int or float, default=2
 |      The minimum number of samples required to split an internal node:
 |  
 |      - If int, then consider `min_samples_split` as the minimum number.
 |      - If float, then `min_samples_split` is a fraction and
 |        `ceil(min_samples_split * n_samples)` are the minimum
 |        number of samples for each split.
 |  
 |      .. versionchanged:: 0.18
 |         Added float values for fractions.
 |  
 |  min_samples_leaf : int or float, default=1
 |      The minimum number of samples required to be at a leaf node.
 |      A split point at any depth will only be considered if it leaves at
 |      least ``min_samples_leaf`` training samples in each of the left and
 |      right branches.  This may have the effect of smoothing the model,
 |      especially in regression.
 |  
 |      - If int, then consider `min_samples_leaf` as the minimum number.
 |      - If float, then `min_samples_leaf` is a fraction and
 |        `ceil(min_samples_leaf * n_samples)` are the minimum
 |        number of samples for each node.
 |  
 |      .. versionchanged:: 0.18
 |         Added float values for fractions.
 |  
 |  min_weight_fraction_leaf : float, default=0.0
 |      The minimum weighted fraction of the sum total of weights (of all
 |      the input samples) required to be at a leaf node. Samples have
 |      equal weight when sample_weight is not provided.
 |  
 |  max_features : int, float or {"auto", "sqrt", "log2"}, default=None
 |      The number of features to consider when looking for the best split:
 |  
 |          - If int, then consider `max_features` features at each split.
 |          - If float, then `max_features` is a fraction and
 |            `max(1, int(max_features * n_features_in_))` features are considered at
 |            each split.
 |          - If "sqrt", then `max_features=sqrt(n_features)`.
 |          - If "log2", then `max_features=log2(n_features)`.
 |          - If None, then `max_features=n_features`.
 |  
 |      Note: the search for a split does not stop until at least one
 |      valid partition of the node samples is found, even if it requires to
 |      effectively inspect more than ``max_features`` features.
 |  
 |  random_state : int, RandomState instance or None, default=None
 |      Controls the randomness of the estimator. The features are always
 |      randomly permuted at each split, even if ``splitter`` is set to
 |      ``"best"``. When ``max_features < n_features``, the algorithm will
 |      select ``max_features`` at random at each split before finding the best
 |      split among them. But the best found split may vary across different
 |      runs, even if ``max_features=n_features``. That is the case, if the
 |      improvement of the criterion is identical for several splits and one
 |      split has to be selected at random. To obtain a deterministic behaviour
 |      during fitting, ``random_state`` has to be fixed to an integer.
 |      See :term:`Glossary <random_state>` for details.
 |  
 |  max_leaf_nodes : int, default=None
 |      Grow a tree with ``max_leaf_nodes`` in best-first fashion.
 |      Best nodes are defined as relative reduction in impurity.
 |      If None then unlimited number of leaf nodes.
 |  
 |  min_impurity_decrease : float, default=0.0
 |      A node will be split if this split induces a decrease of the impurity
 |      greater than or equal to this value.
 |  
 |      The weighted impurity decrease equation is the following::
 |  
 |          N_t / N * (impurity - N_t_R / N_t * right_impurity
 |                              - N_t_L / N_t * left_impurity)
 |  
 |      where ``N`` is the total number of samples, ``N_t`` is the number of
 |      samples at the current node, ``N_t_L`` is the number of samples in the
 |      left child, and ``N_t_R`` is the number of samples in the right child.
 |  
 |      ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum,
 |      if ``sample_weight`` is passed.
 |  
 |      .. versionadded:: 0.19
 |  
 |  class_weight : dict, list of dict or "balanced", default=None
 |      Weights associated with classes in the form ``{class_label: weight}``.
 |      If None, all classes are supposed to have weight one. For
 |      multi-output problems, a list of dicts can be provided in the same
 |      order as the columns of y.
 |  
 |      Note that for multioutput (including multilabel) weights should be
 |      defined for each class of every column in its own dict. For example,
 |      for four-class multilabel classification weights should be
 |      [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of
 |      [{1:1}, {2:5}, {3:1}, {4:1}].
 |  
 |      The "balanced" mode uses the values of y to automatically adjust
 |      weights inversely proportional to class frequencies in the input data
 |      as ``n_samples / (n_classes * np.bincount(y))``
 |  
 |      For multi-output, the weights of each column of y will be multiplied.
 |  
 |      Note that these weights will be multiplied with sample_weight (passed
 |      through the fit method) if sample_weight is specified.
 |  
 |  ccp_alpha : non-negative float, default=0.0
 |      Complexity parameter used for Minimal Cost-Complexity Pruning. The
 |      subtree with the largest cost complexity that is smaller than
 |      ``ccp_alpha`` will be chosen. By default, no pruning is performed. See
 |      :ref:`minimal_cost_complexity_pruning` for details.
 |  
 |      .. versionadded:: 0.22
 |  
 |  Attributes
 |  ----------
 |  classes_ : ndarray of shape (n_classes,) or list of ndarray
 |      The classes labels (single output problem),
 |      or a list of arrays of class labels (multi-output problem).
 |  
 |  feature_importances_ : ndarray of shape (n_features,)
 |      The impurity-based feature importances.
 |      The higher, the more important the feature.
 |      The importance of a feature is computed as the (normalized)
 |      total reduction of the criterion brought by that feature.  It is also
 |      known as the Gini importance [4]_.
 |  
 |      Warning: impurity-based feature importances can be misleading for
 |      high cardinality features (many unique values). See
 |      :func:`sklearn.inspection.permutation_importance` as an alternative.
 |  
 |  max_features_ : int
 |      The inferred value of max_features.
 |  
 |  n_classes_ : int or list of int
 |      The number of classes (for single output problems),
 |      or a list containing the number of classes for each
 |      output (for multi-output problems).
 |  
 |  n_features_in_ : int
 |      Number of features seen during :term:`fit`.
 |  
 |      .. versionadded:: 0.24
 |  
 |  feature_names_in_ : ndarray of shape (`n_features_in_`,)
 |      Names of features seen during :term:`fit`. Defined only when `X`
 |      has feature names that are all strings.
 |  
 |      .. versionadded:: 1.0
 |  
 |  n_outputs_ : int
 |      The number of outputs when ``fit`` is performed.
 |  
 |  tree_ : Tree instance
 |      The underlying Tree object. Please refer to
 |      ``help(sklearn.tree._tree.Tree)`` for attributes of Tree object and
 |      :ref:`sphx_glr_auto_examples_tree_plot_unveil_tree_structure.py`
 |      for basic usage of these attributes.
 |  
 |  See Also
 |  --------
 |  DecisionTreeRegressor : A decision tree regressor.
 |  
 |  Notes
 |  -----
 |  The default values for the parameters controlling the size of the trees
 |  (e.g. ``max_depth``, ``min_samples_leaf``, etc.) lead to fully grown and
 |  unpruned trees which can potentially be very large on some data sets. To
 |  reduce memory consumption, the complexity and size of the trees should be
 |  controlled by setting those parameter values.
 |  
 |  The :meth:`predict` method operates using the :func:`numpy.argmax`
 |  function on the outputs of :meth:`predict_proba`. This means that in
 |  case the highest predicted probabilities are tied, the classifier will
 |  predict the tied class with the lowest index in :term:`classes_`.
 |  
 |  References
 |  ----------
 |  
 |  .. [1] https://en.wikipedia.org/wiki/Decision_tree_learning
 |  
 |  .. [2] L. Breiman, J. Friedman, R. Olshen, and C. Stone, "Classification
 |         and Regression Trees", Wadsworth, Belmont, CA, 1984.
 |  
 |  .. [3] T. Hastie, R. Tibshirani and J. Friedman. "Elements of Statistical
 |         Learning", Springer, 2009.
 |  
 |  .. [4] L. Breiman, and A. Cutler, "Random Forests",
 |         https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
 |  
 |  Examples
 |  --------
 |  >>> from sklearn.datasets import load_iris
 |  >>> from sklearn.model_selection import cross_val_score
 |  >>> from sklearn.tree import DecisionTreeClassifier
 |  >>> clf = DecisionTreeClassifier(random_state=0)
 |  >>> iris = load_iris()
 |  >>> cross_val_score(clf, iris.data, iris.target, cv=10)
 |  ...                             # doctest: +SKIP
 |  ...
 |  array([ 1.     ,  0.93...,  0.86...,  0.93...,  0.93...,
 |          0.93...,  0.93...,  1.     ,  0.93...,  1.      ])
 |  
 |  Method resolution order:
 |      DecisionTreeClassifier
 |      sklearn.base.ClassifierMixin
 |      BaseDecisionTree
 |      sklearn.base.MultiOutputMixin
 |      sklearn.base.BaseEstimator
 |      sklearn.utils._metadata_requests._MetadataRequester
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, *, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  fit(self, X, y, sample_weight=None, check_input=True)
 |      Build a decision tree classifier from the training set (X, y).
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix} of shape (n_samples, n_features)
 |          The training input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csc_matrix``.
 |      
 |      y : array-like of shape (n_samples,) or (n_samples, n_outputs)
 |          The target values (class labels) as integers or strings.
 |      
 |      sample_weight : array-like of shape (n_samples,), default=None
 |          Sample weights. If None, then samples are equally weighted. Splits
 |          that would create child nodes with net zero or negative weight are
 |          ignored while searching for a split in each node. Splits are also
 |          ignored if they would result in any single class carrying a
 |          negative weight in either child node.
 |      
 |      check_input : bool, default=True
 |          Allow to bypass several input checking.
 |          Don't use this parameter unless you know what you're doing.
 |      
 |      Returns
 |      -------
 |      self : DecisionTreeClassifier
 |          Fitted estimator.
 |  
 |  predict_log_proba(self, X)
 |      Predict class log-probabilities of the input samples X.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix} of shape (n_samples, n_features)
 |          The input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csr_matrix``.
 |      
 |      Returns
 |      -------
 |      proba : ndarray of shape (n_samples, n_classes) or list of n_outputs             such arrays if n_outputs > 1
 |          The class log-probabilities of the input samples. The order of the
 |          classes corresponds to that in the attribute :term:`classes_`.
 |  
 |  predict_proba(self, X, check_input=True)
 |      Predict class probabilities of the input samples X.
 |      
 |      The predicted class probability is the fraction of samples of the same
 |      class in a leaf.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix} of shape (n_samples, n_features)
 |          The input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csr_matrix``.
 |      
 |      check_input : bool, default=True
 |          Allow to bypass several input checking.
 |          Don't use this parameter unless you know what you're doing.
 |      
 |      Returns
 |      -------
 |      proba : ndarray of shape (n_samples, n_classes) or list of n_outputs             such arrays if n_outputs > 1
 |          The class probabilities of the input samples. The order of the
 |          classes corresponds to that in the attribute :term:`classes_`.
 |  
 |  set_fit_request(self: sklearn.tree._classes.DecisionTreeClassifier, *, check_input: Union[bool, NoneType, str] = '$UNCHANGED$', sample_weight: Union[bool, NoneType, str] = '$UNCHANGED$') -> sklearn.tree._classes.DecisionTreeClassifier
 |      Request metadata passed to the ``fit`` method.
 |      
 |      Note that this method is only relevant if
 |      ``enable_metadata_routing=True`` (see :func:`sklearn.set_config`).
 |      Please see :ref:`User Guide <metadata_routing>` on how the routing
 |      mechanism works.
 |      
 |      The options for each parameter are:
 |      
 |      - ``True``: metadata is requested, and passed to ``fit`` if provided. The request is ignored if metadata is not provided.
 |      
 |      - ``False``: metadata is not requested and the meta-estimator will not pass it to ``fit``.
 |      
 |      - ``None``: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
 |      
 |      - ``str``: metadata should be passed to the meta-estimator with this given alias instead of the original name.
 |      
 |      The default (``sklearn.utils.metadata_routing.UNCHANGED``) retains the
 |      existing request. This allows you to change the request for some
 |      parameters and not others.
 |      
 |      .. versionadded:: 1.3
 |      
 |      .. note::
 |          This method is only relevant if this estimator is used as a
 |          sub-estimator of a meta-estimator, e.g. used inside a
 |          :class:`pipeline.Pipeline`. Otherwise it has no effect.
 |      
 |      Parameters
 |      ----------
 |      check_input : str, True, False, or None,                     default=sklearn.utils.metadata_routing.UNCHANGED
 |          Metadata routing for ``check_input`` parameter in ``fit``.
 |      
 |      sample_weight : str, True, False, or None,                     default=sklearn.utils.metadata_routing.UNCHANGED
 |          Metadata routing for ``sample_weight`` parameter in ``fit``.
 |      
 |      Returns
 |      -------
 |      self : object
 |          The updated object.
 |  
 |  set_predict_proba_request(self: sklearn.tree._classes.DecisionTreeClassifier, *, check_input: Union[bool, NoneType, str] = '$UNCHANGED$') -> sklearn.tree._classes.DecisionTreeClassifier
 |      Request metadata passed to the ``predict_proba`` method.
 |      
 |      Note that this method is only relevant if
 |      ``enable_metadata_routing=True`` (see :func:`sklearn.set_config`).
 |      Please see :ref:`User Guide <metadata_routing>` on how the routing
 |      mechanism works.
 |      
 |      The options for each parameter are:
 |      
 |      - ``True``: metadata is requested, and passed to ``predict_proba`` if provided. The request is ignored if metadata is not provided.
 |      
 |      - ``False``: metadata is not requested and the meta-estimator will not pass it to ``predict_proba``.
 |      
 |      - ``None``: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
 |      
 |      - ``str``: metadata should be passed to the meta-estimator with this given alias instead of the original name.
 |      
 |      The default (``sklearn.utils.metadata_routing.UNCHANGED``) retains the
 |      existing request. This allows you to change the request for some
 |      parameters and not others.
 |      
 |      .. versionadded:: 1.3
 |      
 |      .. note::
 |          This method is only relevant if this estimator is used as a
 |          sub-estimator of a meta-estimator, e.g. used inside a
 |          :class:`pipeline.Pipeline`. Otherwise it has no effect.
 |      
 |      Parameters
 |      ----------
 |      check_input : str, True, False, or None,                     default=sklearn.utils.metadata_routing.UNCHANGED
 |          Metadata routing for ``check_input`` parameter in ``predict_proba``.
 |      
 |      Returns
 |      -------
 |      self : object
 |          The updated object.
 |  
 |  set_predict_request(self: sklearn.tree._classes.DecisionTreeClassifier, *, check_input: Union[bool, NoneType, str] = '$UNCHANGED$') -> sklearn.tree._classes.DecisionTreeClassifier
 |      Request metadata passed to the ``predict`` method.
 |      
 |      Note that this method is only relevant if
 |      ``enable_metadata_routing=True`` (see :func:`sklearn.set_config`).
 |      Please see :ref:`User Guide <metadata_routing>` on how the routing
 |      mechanism works.
 |      
 |      The options for each parameter are:
 |      
 |      - ``True``: metadata is requested, and passed to ``predict`` if provided. The request is ignored if metadata is not provided.
 |      
 |      - ``False``: metadata is not requested and the meta-estimator will not pass it to ``predict``.
 |      
 |      - ``None``: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
 |      
 |      - ``str``: metadata should be passed to the meta-estimator with this given alias instead of the original name.
 |      
 |      The default (``sklearn.utils.metadata_routing.UNCHANGED``) retains the
 |      existing request. This allows you to change the request for some
 |      parameters and not others.
 |      
 |      .. versionadded:: 1.3
 |      
 |      .. note::
 |          This method is only relevant if this estimator is used as a
 |          sub-estimator of a meta-estimator, e.g. used inside a
 |          :class:`pipeline.Pipeline`. Otherwise it has no effect.
 |      
 |      Parameters
 |      ----------
 |      check_input : str, True, False, or None,                     default=sklearn.utils.metadata_routing.UNCHANGED
 |          Metadata routing for ``check_input`` parameter in ``predict``.
 |      
 |      Returns
 |      -------
 |      self : object
 |          The updated object.
 |  
 |  set_score_request(self: sklearn.tree._classes.DecisionTreeClassifier, *, sample_weight: Union[bool, NoneType, str] = '$UNCHANGED$') -> sklearn.tree._classes.DecisionTreeClassifier
 |      Request metadata passed to the ``score`` method.
 |      
 |      Note that this method is only relevant if
 |      ``enable_metadata_routing=True`` (see :func:`sklearn.set_config`).
 |      Please see :ref:`User Guide <metadata_routing>` on how the routing
 |      mechanism works.
 |      
 |      The options for each parameter are:
 |      
 |      - ``True``: metadata is requested, and passed to ``score`` if provided. The request is ignored if metadata is not provided.
 |      
 |      - ``False``: metadata is not requested and the meta-estimator will not pass it to ``score``.
 |      
 |      - ``None``: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
 |      
 |      - ``str``: metadata should be passed to the meta-estimator with this given alias instead of the original name.
 |      
 |      The default (``sklearn.utils.metadata_routing.UNCHANGED``) retains the
 |      existing request. This allows you to change the request for some
 |      parameters and not others.
 |      
 |      .. versionadded:: 1.3
 |      
 |      .. note::
 |          This method is only relevant if this estimator is used as a
 |          sub-estimator of a meta-estimator, e.g. used inside a
 |          :class:`pipeline.Pipeline`. Otherwise it has no effect.
 |      
 |      Parameters
 |      ----------
 |      sample_weight : str, True, False, or None,                     default=sklearn.utils.metadata_routing.UNCHANGED
 |          Metadata routing for ``sample_weight`` parameter in ``score``.
 |      
 |      Returns
 |      -------
 |      self : object
 |          The updated object.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset()
 |  
 |  __annotations__ = {'_parameter_constraints': <class 'dict'>}
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.ClassifierMixin:
 |  
 |  score(self, X, y, sample_weight=None)
 |      Return the mean accuracy on the given test data and labels.
 |      
 |      In multi-label classification, this is the subset accuracy
 |      which is a harsh metric since you require for each sample that
 |      each label set be correctly predicted.
 |      
 |      Parameters
 |      ----------
 |      X : array-like of shape (n_samples, n_features)
 |          Test samples.
 |      
 |      y : array-like of shape (n_samples,) or (n_samples, n_outputs)
 |          True labels for `X`.
 |      
 |      sample_weight : array-like of shape (n_samples,), default=None
 |          Sample weights.
 |      
 |      Returns
 |      -------
 |      score : float
 |          Mean accuracy of ``self.predict(X)`` w.r.t. `y`.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.ClassifierMixin:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from BaseDecisionTree:
 |  
 |  apply(self, X, check_input=True)
 |      Return the index of the leaf that each sample is predicted as.
 |      
 |      .. versionadded:: 0.17
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix} of shape (n_samples, n_features)
 |          The input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csr_matrix``.
 |      
 |      check_input : bool, default=True
 |          Allow to bypass several input checking.
 |          Don't use this parameter unless you know what you're doing.
 |      
 |      Returns
 |      -------
 |      X_leaves : array-like of shape (n_samples,)
 |          For each datapoint x in X, return the index of the leaf x
 |          ends up in. Leaves are numbered within
 |          ``[0; self.tree_.node_count)``, possibly with gaps in the
 |          numbering.
 |  
 |  cost_complexity_pruning_path(self, X, y, sample_weight=None)
 |      Compute the pruning path during Minimal Cost-Complexity Pruning.
 |      
 |      See :ref:`minimal_cost_complexity_pruning` for details on the pruning
 |      process.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix} of shape (n_samples, n_features)
 |          The training input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csc_matrix``.
 |      
 |      y : array-like of shape (n_samples,) or (n_samples, n_outputs)
 |          The target values (class labels) as integers or strings.
 |      
 |      sample_weight : array-like of shape (n_samples,), default=None
 |          Sample weights. If None, then samples are equally weighted. Splits
 |          that would create child nodes with net zero or negative weight are
 |          ignored while searching for a split in each node. Splits are also
 |          ignored if they would result in any single class carrying a
 |          negative weight in either child node.
 |      
 |      Returns
 |      -------
 |      ccp_path : :class:`~sklearn.utils.Bunch`
 |          Dictionary-like object, with the following attributes.
 |      
 |          ccp_alphas : ndarray
 |              Effective alphas of subtree during pruning.
 |      
 |          impurities : ndarray
 |              Sum of the impurities of the subtree leaves for the
 |              corresponding alpha value in ``ccp_alphas``.
 |  
 |  decision_path(self, X, check_input=True)
 |      Return the decision path in the tree.
 |      
 |      .. versionadded:: 0.18
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix} of shape (n_samples, n_features)
 |          The input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csr_matrix``.
 |      
 |      check_input : bool, default=True
 |          Allow to bypass several input checking.
 |          Don't use this parameter unless you know what you're doing.
 |      
 |      Returns
 |      -------
 |      indicator : sparse matrix of shape (n_samples, n_nodes)
 |          Return a node indicator CSR matrix where non zero elements
 |          indicates that the samples goes through the nodes.
 |  
 |  get_depth(self)
 |      Return the depth of the decision tree.
 |      
 |      The depth of a tree is the maximum distance between the root
 |      and any leaf.
 |      
 |      Returns
 |      -------
 |      self.tree_.max_depth : int
 |          The maximum depth of the tree.
 |  
 |  get_n_leaves(self)
 |      Return the number of leaves of the decision tree.
 |      
 |      Returns
 |      -------
 |      self.tree_.n_leaves : int
 |          Number of leaves.
 |  
 |  predict(self, X, check_input=True)
 |      Predict class or regression value for X.
 |      
 |      For a classification model, the predicted class for each sample in X is
 |      returned. For a regression model, the predicted value based on X is
 |      returned.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix} of shape (n_samples, n_features)
 |          The input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csr_matrix``.
 |      
 |      check_input : bool, default=True
 |          Allow to bypass several input checking.
 |          Don't use this parameter unless you know what you're doing.
 |      
 |      Returns
 |      -------
 |      y : array-like of shape (n_samples,) or (n_samples, n_outputs)
 |          The predicted classes, or the predict values.
 |  
 |  ----------------------------------------------------------------------
 |  Readonly properties inherited from BaseDecisionTree:
 |  
 |  feature_importances_
 |      Return the feature importances.
 |      
 |      The importance of a feature is computed as the (normalized) total
 |      reduction of the criterion brought by that feature.
 |      It is also known as the Gini importance.
 |      
 |      Warning: impurity-based feature importances can be misleading for
 |      high cardinality features (many unique values). See
 |      :func:`sklearn.inspection.permutation_importance` as an alternative.
 |      
 |      Returns
 |      -------
 |      feature_importances_ : ndarray of shape (n_features,)
 |          Normalized total reduction of criteria by feature
 |          (Gini importance).
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __getstate__(self)
 |      Helper for pickle.
 |  
 |  __repr__(self, N_CHAR_MAX=700)
 |      Return repr(self).
 |  
 |  __setstate__(self, state)
 |  
 |  __sklearn_clone__(self)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep : bool, default=True
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : dict
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as :class:`~sklearn.pipeline.Pipeline`). The latter have
 |      parameters of the form ``<component>__<parameter>`` so that it's
 |      possible to update each component of a nested object.
 |      
 |      Parameters
 |      ----------
 |      **params : dict
 |          Estimator parameters.
 |      
 |      Returns
 |      -------
 |      self : estimator instance
 |          Estimator instance.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.utils._metadata_requests._MetadataRequester:
 |  
 |  get_metadata_routing(self)
 |      Get metadata routing of this object.
 |      
 |      Please check :ref:`User Guide <metadata_routing>` on how the routing
 |      mechanism works.
 |      
 |      Returns
 |      -------
 |      routing : MetadataRequest
 |          A :class:`~utils.metadata_routing.MetadataRequest` encapsulating
 |          routing information.
 |  
 |  ----------------------------------------------------------------------
 |  Class methods inherited from sklearn.utils._metadata_requests._MetadataRequester:
 |  
 |  __init_subclass__(**kwargs) from abc.ABCMeta
 |      Set the ``set_{method}_request`` methods.
 |      
 |      This uses PEP-487 [1]_ to set the ``set_{method}_request`` methods. It
 |      looks for the information available in the set default values which are
 |      set using ``__metadata_request__*`` class attributes, or inferred
 |      from method signatures.
 |      
 |      The ``__metadata_request__*`` class attributes are used when a method
 |      does not explicitly accept a metadata through its arguments or if the
 |      developer would like to specify a request value for those metadata
 |      which are different from the default ``None``.
 |      
 |      References
 |      ----------
 |      .. [1] https://www.python.org/dev/peps/pep-0487

In [116]:
model11 = DecisionTreeClassifier(criterion='entropy', max_depth=5, min_samples_split=2)
model11.fit(x_train,y_train)
Out[116]:
DecisionTreeClassifier(criterion='entropy', max_depth=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(criterion='entropy', max_depth=5)
In [117]:
print('Training set score: {:.4f}'.format(model11.score(x_train, y_train)))
print('Test set score: {:.4f}'.format(model11.score(x_test, y_test)))
Training set score: 0.8887
Test set score: 0.8149
In [122]:
pred_dtc = model11.predict(X)
In [120]:
pred_y_dtc = logreg.predict(x_test)

Performance Metrics of Decision Tree Classifier¶

Confusion Matrix¶
In [124]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import multilabel_confusion_matrix
cm = multilabel_confusion_matrix(y, pred_dtc)
cs = confusion_matrix(y,pred_dtc)

print('Confusion matrix\n\n', cm)

print('\nTrue Positives(TP) = ', cm[0,0])

print('\nTrue Negatives(TN) = ', cm[1,1])

print('\nFalse Positives(FP) = ', cm[0,1])

print('\nFalse Negatives(FN) = ', cm[1,0])
Confusion matrix

 [[[  99  132]
  [  53 1102]]

 [[1102   53]
  [ 132   99]]]

True Positives(TP) =  [ 99 132]

True Negatives(TN) =  [132  99]

False Positives(FP) =  [  53 1102]

False Negatives(FN) =  [1102   53]
In [125]:
TP = cs[0,0]
TN = cs[1,1]
FP = cs[0,1]
FN = cs[1,0]
In [126]:
classification_accuracy = (TP + TN) / float(TP + TN + FP + FN)

print('Classification accuracy : {0:0.4f}'.format(classification_accuracy))
Classification accuracy : 0.8665
In [127]:
classification_error = (FP + FN) / float(TP + TN + FP + FN)

print('Classification error : {0:0.4f}'.format(classification_error))
Classification error : 0.1335
In [128]:
precision = TP / float(TP + FP)


print('Precision : {0:0.4f}'.format(precision))
Precision : 0.9541
In [129]:
recall = TP / float(TP + FN)

print('Recall or Sensitivity : {0:0.4f}'.format(recall))
Recall or Sensitivity : 0.8930
In [130]:
true_positive_rate = TP / float(TP + FN)
print('True Positive Rate : {0:0.4f}'.format(true_positive_rate))
True Positive Rate : 0.8930
In [131]:
false_positive_rate = FP / float(FP + TN)


print('False Positive Rate : {0:0.4f}'.format(false_positive_rate))
False Positive Rate : 0.3487
In [132]:
specificity = TN / (TN + FP)
print('Specificity : {0:0.4f}'.format(specificity))
Specificity : 0.6513
In [133]:
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

# Computing the F1 score
f1score = f1_score(y_test, pred_y_log, average='micro')

print("F1 score: {:.2f}".format(f1score))
F1 score: 0.87